{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bguORR-uVtyk"
      },
      "outputs": [],
      "source": [
        "# Copyright 2021 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f1KuZ_LBcHee"
      },
      "source": [
        "# Feedback or issues?\n",
        "\n",
        "For any feedback or questions, please open an [issue](https://github.com/googleapis/python-aiplatform/issues)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A4QHhG05cJD9"
      },
      "source": [
        "# Vertex SDK for Python: AutoML Text Extraction Example\n",
        "To use this Jupyter notebook, copy the notebook to a Google Cloud Notebooks instance with Tensorflow installed and open it. You can run each step, or cell, and see its results. To run a cell, use Shift+Enter. Jupyter automatically displays the return value of the last line in each cell. For more information about running notebooks in Google Cloud Notebook, see the [Google Cloud Notebook guide](https://cloud.google.com/vertex-ai/docs/general/notebooks).\n",
        "\n",
        "\n",
        "This notebook demonstrate how to create an AutoML Text Extraction Model, with a Vertex AI text dataset, and how to serve the model for online prediction.\n",
        "\n",
        "Note: you may incur charges for training, prediction, storage or usage of other GCP products in connection with testing this SDK."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lld3eeJUs5yM"
      },
      "source": [
        "### Install Vertex SDK for Python\n",
        "\n",
        "\n",
        "After the SDK installation the kernel will be automatically restarted."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sBfZtR4X1Dr_"
      },
      "outputs": [],
      "source": [
        "!pip3 uninstall -y google-cloud-aiplatform\n",
        "!pip3 install google-cloud-aiplatform\n",
        "import IPython\n",
        "\n",
        "app = IPython.Application.instance()\n",
        "app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kBFvlCFh5Yij"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yz_rkDIteP5M"
      },
      "source": [
        "### Enter Your Project and GCS Bucket\n",
        "\n",
        "Enter your Project Id in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iqSQT6Z6bekX"
      },
      "outputs": [],
      "source": [
        "MY_PROJECT = \"YOUR PROJECT ID\"\n",
        "MY_STAGING_BUCKET = \"gs://YOUR BUCKET\"  # bucket should be in same region as ucaip"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rk43VP_IqcTE"
      },
      "source": [
        "## Initialize Vertex SDK for Python\n",
        "\n",
        "Initialize the *client* for Vertex AI."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VCiC9gBWqcTF"
      },
      "outputs": [],
      "source": [
        "from google.cloud import aiplatform\n",
        "\n",
        "aiplatform.init(project=MY_PROJECT, staging_bucket=MY_STAGING_BUCKET)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "35QVNhACqcTJ"
      },
      "source": [
        "## Create a Dataset on Vertex AI\n",
        "We will now create a Vertex AI text dataset using the previously prepared jsonl files. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XGKZ3bdyTMcd"
      },
      "source": [
        "### The biomedical dataset\n",
        "To create an entity extraction model, use a corpus of biomedical research abstracts that mention hundreds of diseases and concepts. The resulting model identifies these medical entities in other documents.\n",
        "\n",
        "The goal of the corpus is to advance the understanding of the causes of happiness through text-based reflection.\n",
        "\n",
        "Please reference [AutoML Documentation](https://cloud.google.com/natural-language/automl/docs/quickstart#model_objectives) for more information."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KeNtSVjRxVCC"
      },
      "outputs": [],
      "source": [
        "# Text Extraction\n",
        "IMPORT_FILE = \"gs://ucaip-test-us-central1/dataset/ucaip_ten_dataset.jsonl\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4OfCqaYRqcTJ"
      },
      "outputs": [],
      "source": [
        "ds = aiplatform.TextDataset.create(\n",
        "    display_name=\"text-extraction\",\n",
        "    gcs_source=[IMPORT_FILE],\n",
        "    import_schema_uri=aiplatform.schema.dataset.ioformat.text.extraction,\n",
        ")\n",
        "\n",
        "ds.resource_name"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6-bBqipfqcTS"
      },
      "source": [
        "## Launch a Training Job and Create a Model on Vertex AI"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aA41rT_mb-rV"
      },
      "outputs": [],
      "source": [
        "job = aiplatform.AutoMLTextTrainingJob(\n",
        "    display_name=\"text-extraction\", prediction_type=\"extraction\"\n",
        ")\n",
        "\n",
        "# This will take around an hour to run\n",
        "model = job.run(\n",
        "    dataset=ds,\n",
        "    training_fraction_split=0.6,\n",
        "    validation_fraction_split=0.2,\n",
        "    test_fraction_split=0.2,\n",
        "    model_display_name=\"text-extraction\",\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5vhDsMJNqcTW"
      },
      "source": [
        "# Deploy Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Y9GH72wWqcTX"
      },
      "outputs": [],
      "source": [
        "endpoint = model.deploy()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nIw1ifPuqcTb"
      },
      "source": [
        "# Predict on Endpoint"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3k6-rSZPqcTc"
      },
      "outputs": [],
      "source": [
        "input_text = \"\"\"\n",
        "Phenotypic variation including retinitis pigmentosa, pattern dystrophy, and fundus flavimaculatus in a single family with a deletion of codon 153 or 154 of the peripherin/RDS gene.\\tBACKGROUND AND OBJECTIVES  Mutations of the peripherin / RDS gene have been reported in autosomal dominant retinitis pigmentosa , pattern macular dystrophy , and retinitis punctata albescens . We report herein the occurrence of three separate phenotypes within a single family with a novel 3-base pair deletion of codon 153 or 154 of the peripherin / RDS gene . DESIGN  Case reports with clinical features , fluorescein angiography , kinetic perimetry , electrophysiological studies , and molecular genetics . SETTING  University medical centers . PATIENTS  A 75-year-old woman , her two daughters ( aged 44 and 50 years ) , and her 49-year-old son were screened for peripherin / RDS mutations because of the presence of multiple phenotypes within the same family . RESULTS  The mother presented at age 63 years with a profoundly abnormal electroretinogram ( ERG ) and adult-onset retinitis pigmentosa that progressed dramatically over 12 years , with marked loss of peripheral visual field . One daughter developed pattern macular dystrophy at age 31 years . At age 44 years , her ERG was moderately abnormal but her clinical disease was limited to the macula . Another daughter presented at age 42 years with macular degeneration and over 10 years developed the clinical picture of fundus flavimaculatus . Her peripheral visual field was preserved but her ERG was moderately abnormal . The son had onset of macular degeneration at age 44 years . Pericentral scotomas were present and the ERG was markedly abnormal . Fluorescein angiography revealed punctate pigment epithelial transmission defects . CONCLUSIONS  A 3-base pair deletion of codon 153 or 154 of the peripherin / RDS gene can produce clinically disparate phenotypes even within the same family\n",
        "Splicing defects in the ataxia-telangiectasia gene, ATM: underlying mutations and consequences.\\tMutations resulting in defective splicing constitute a significant proportion ( 30 / 62 [ 48 % ] ) of a new series of mutations in the ATM gene in patients with ataxia-telangiectasia ( AT ) that were detected by the protein-truncation assay followed by sequence analysis of genomic DNA . Fewer than half of the splicing mutations involved the canonical AG splice-acceptor site or GT splice-donor site . A higher percentage of mutations occurred at less stringently conserved sites , including silent mutations of the last nucleotide of exons , mutations in nucleotides other than the conserved AG and GT in the consensus splice sites , and creation of splice-acceptor or splice-donor sites in either introns or exons . These splicing mutations led to a variety of consequences , including exon skipping and , to a lesser degree , intron retention , activation of cryptic splice sites , or creation of new splice sites . In addition , 5 of 12 nonsense mutations and 1 missense mutation were associated with deletion in the cDNA of the exons in which the mutations occurred . No ATM protein was detected by western blotting in any AT cell line in which splicing mutations were identified . Several cases of exon skipping in both normal controls and patients for whom no underlying defect could be found in genomic DNA were also observed , suggesting caution in the interpretation of exon deletions observed in ATM cDNA when there is no accompanying identification of genomic mutations .\n",
        "\"\"\"\n",
        "\n",
        "instances_list = [{\"content\": input_text}]\n",
        "\n",
        "prediction = endpoint.predict(instances_list)\n",
        "prediction"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RXpchK0oWqWq"
      },
      "outputs": [],
      "source": [
        "prediction_instance = prediction.predictions[0]\n",
        "\n",
        "extractions = zip(\n",
        "    prediction_instance[\"ids\"],\n",
        "    prediction_instance[\"textSegmentStartOffsets\"],\n",
        "    prediction_instance[\"textSegmentEndOffsets\"],\n",
        "    prediction_instance[\"confidences\"],\n",
        "    prediction_instance[\"displayNames\"],\n",
        ")\n",
        "\n",
        "for id, start, end, confidence, display_name in extractions:\n",
        "    print(\n",
        "        f\"{id}: '{input_text[int(start):int(end)]}' predicted as '{display_name}'' with confidence {confidence}\"\n",
        "    )"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "AI_Platform_(Unified)_SDK_AutoML_Text_Extraction_Training.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
